by Bruno Yamada
This dataset if one of two dataset which has been made public available for research, one for red and another for white wine samples, The white variant has been chosen since it had more observations.
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine, and variables include objective tests(e.g. PH values), residual sugar levels after fermentation, density compared to water, alcohol percentage, among others, including score, where each sample was evaluated by at least 3 wine experts, giving a rating between 0 (very bad) and 10 (very excellent).
Credits to:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
This dataset has 4898 observations, with 12 variables for each observation.
Variables in the Dataset:
It seems we have a normal-like distribution, also, apparently there are no wines with quality above 8, or below 3, let’s confirm by improving our plot:
In fact, it appears experts have graded no wine as deserving of a quality score above 9 or below 3, in fact, most wines stood above 5, which was expected to be a wine of average quality.
Now lets take a look over all the variables to see their distribution:
Most variables appear to be normally distributed, where alcohol appears bimodal
Looking at the boxplots, nearly all of the variables have some outliers which are values with at least three times the height of the box (interquartile range)
Lets take a look at a zoomed version for residual.sugar, and its outliers:
So we can see that although most wines are in the 1 to 20 range, there are a couple wines with a score above 30, and one going as far as above 65, where according to description, a score around 45 is given to a wine considered sweet
Chlorides, Volatile Acidity and Free Sulfur Dioxide are the ones which the most outliers, as can be seen in the following boxplots:
Our dataset has 4898 wines with 12 variables each, none of the variables is discriminative, thus we have no ordered factor variable.
Other Observations: - No wine has a score below 3, neither above 9 (it ranges from 0 to 10) - most variables have outliers, with some going as far as more than 20 times the interquartile range (chlorides)
quality - as it is the overall score for each given wine
alcohol - due to wine being an alcoholic beverage, although it might not be entirely correlated to the overall quality density - due to the low amount of outliers, and the shape of the histogram which might suggest a some correlation to quality
All of the variables appears to be fairly independent from one another, except for the quality which could be related to all of the other variables, or a combination of them
No feature was unusually distributed, most seemed to resemble normal or long-tailed negative binomial distributions.
First let us take a look at how our variables are related to on another:
this plot was created using ggplot2’s ggpairs function
All the variables except for one show low correlation scores, here are some points of interest:
Interpreting the Scores:
as most variables do not have a high correlation scores between one another,
they could be used for algorithms were the inductive bias is that variables are
independent from one another.
a high correlation score with another variable can be an indicator that we
could use the first variable as an input to predict the other.
So Density showed a high correlation with Residual Sugar of 0.83, indicating that the sweeter the wine, the higher the density, as we can see by the red line below.
It might not be as linearly distributed as we expect, but still the correlation is clear.
On the other hand, with a correlation score of -0.78, alcohol has an inverse relationship with density,were the higher the percentage of alcohol, the lower is the density of the wine, it’s curious and makes sense when you think about it
Their correlation score was 0.61, which is to be expected as total sulfur dioxide is a variable composed by free and bound sulfur dioxide, buth with a score of 0.6 you can see that the data points are more spread around the red line.
Quality had no big correlation with any other variable, but 2 were noticeable:
And we start to see the dots connecting, wines with higher scores had a correlation with alcohol percentages (although low, still noticeable), and the alcohol the wine has, usually, the lower the density for said wine, so, just as well, a wine with a high density, would have less alcohol, thus lower quality score.
In this section we saw that our feature of intereset, quality, had lower than expected correlation with most variables, noticeable though, it had a higher correlation with alcohol and negative correlation with density, which were the other features chosen as worth paying attention.
density and alcohol had a high negative correlation score between them, meaning the higher one gets, the lower is the other.
density and residual sugar had a high correlation score
free sulfur dioxide and total sulfur dioxide had a high correlation score as one can be thought of as composed by the other, although it still was lower then expected when you think about it that way.
density and residual sugar had the strongest relationship, although
for out observations, the most important relationship was between quality and
alcohol
Now lets investigate multiple variables combined
So, based on the previous plots, we could assume that, to get a better quality wine, we need: - a higher alcohol percentage - a low wine density score
Let’s plot a graph to check:
As noted before, most wines are of average quality, so when we colored the
points by color, it got a little confusing, but if we change our gradient,
adding a third color, and a darker theme, it becomes clearer:
we also took a quantile equivalent to 99.9% of our data, to get a cleaner graph
Now we can clearly see where most wines of higher quality are located, and it
appears we were right:
Wines with higher alcohol percentages and lower density scores are
higher quality, as noted by the points that tends towards blue.
But is that all?
If alcohol and density has such a high correlation with quality, positive or negative, we could take a look at the other features and their correlations with density or alcohol.
Let us check some features which show high correlations with alcohol:
So, chlorides does not appears to affect wine quality as long as its around 0.5
But a high chloride (amount of salt) can definitely reduce the quality of the wine
as noted by the red points, nearly all points with chlorides above 0.10 are
average or lower quality wines.
Total Sulfur Dioxide can also somewhat influence the quality, although not as
much as chlorides, as long as it stays somewhere bellow 220, it should have little effect on wine quality, but a higher amount will affect the quality, so we confirmed the feature description:
“… in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.”
Residual has a high correlation with alcohol, and to some extend can affect the quality, as seen in the graph, values higher than 15 can reduce the quality, below that, it does not appears to have much of a effect
Lastly, the plot below showed an interesting pattern:
So, we know by previous plots that wine quality decreases as density increases, but when we paired density with residual sugar, we can clearly see that good wines have a sort of sweet spot relationship between density and residual sugar, if you focus only on the blue dots, you can see that as density increases, if residual sugar is also increased, the quality is somewhat kept at the same level, although as seen in previous features, once you go past a certain threshold, the pattern dissipates
So, right off the bat, I can guess we problably do not have enough observations to create an effective model, since the majority of the wines are of average quality, as noted previously in the Wine Quality Histogram, still, we can try creating a linear model to validate our hypothesis:
library(caret)
set.seed(42)
# select columns according to our exploratory analysis
feature_columns <- c(
'alcohol',
'density',
'total.sulfur.dioxide',
'chlorides',
'residual.sugar'
)
train_data = subset(df, select=feature_columns)
train_data$quality <- df$quality
# Take 70% of our data as training data
split_index <- sample(1:nrow(df), 0.7 * nrow(df))
df_train <- train_data[split_index,]
df_test <- train_data[-split_index,]
# Train the linear model
model <- train(quality~., data=df_train, method='lm')
# Predict the test samples
predicteds <- predict(model, df_test)
# create a new dataframe for further use containing the predictions and errors
actuals_preds <- data.frame(cbind(
actuals=df_test,
predicteds=predicteds,
errors=abs(df_test$quality-predicteds)
))
Now let’s take a look at the prediction errors:
We separated 30% of our dataset as a testing set and didn’t use it for training our linear model, with the model trained using the other 70%, we predicted the wine quality for the samples in the test set, and calculated the error for each sample as being the difference between the predicted value and actual quality of said sample.
So, as suspected, we simply don’t have enough observations of high scoring wines(above 8) or low scoring ones (below 3), so:
your feature(s) of interest?
Alcohol and Residual Sugar had an negative correlation, meaning the lower one is, the higher the other, which makes sense, since higher alcohol percentages would mean more sugar went through the process of fermentation
Density and Residual Sugar had an strong relationship, meaning more sugar made the wine more dense.
It was interesting to see the relationship for chlorides(amount of salt), were as long as it was below a certain threshold, it could help improve the quality but past that mark, it definitely had a negative effect.
Due to the distribution of the dataset, the model can simply learn to guess 6 all time, and it will get it right most of the time, since the dataset is mostly composed of wines with a score of 6.
One alternative would be for us to balance the dataset, by taking only as many samples for each quality level, as the lowest frequent quality level, but then, we simply wouldn’t have enough samples left to use in our linear model.
Throughout this analysis, I found the two main features for predicting wine quality, alcohol and density, when it came time to train the model, I discovered the distribution of our dataset didn’t really give enought data points for predicting either high or low quality wine, since most were average quality ones.
The following plots area meant to show these findings in greater detail.
By analysis of the ggpairs outputed graph, we found two main features for predicting wine quality, alcohol and density, we knew their correlation score was negative, so as one increases, the other decreases, so to have a good wine, we would need, mainly:
With proper choice of point colors, background, and removing a few outliers by taking the quantile representing 99,9% of our data, we came to the above plot, were you can clearly see better wines (blue dots) have high alcohol percentages, with low density.
When it came time to train an model, I was soon worried by the distribution of the observations of this dataset, ideally, we would have an equal amount of good, average and bad wines, but this dataset was mostly composed of average wines (quality from 5 to 7), and I was worried we wouldn’t have enough high quality samples to figure out what really took to make an high quality wine, in the above plot you can really see how our samples area spread across quality levels.
We trained an model despite the hypothesis that we didn’t had enough data, but it was interesting to see the model act as we expected, by looking the plot you can see:
For this project, there were two options of datasets to use, white wine and red wine, I choosed red wine because it had the most samples, and I figured it was a better representation of real world red wine distributions (which it probably was), but the problem was that it didn’t contain an equality distributed dataset, nor did it contain enough samples that I could balance it and still have enough samples left for an prediction model.
When starting the analysis, the most important plot was the one result from the ggpairs function, it simply gave us a very useful and summarized plot for all the relationships between the variables, and from there I could further investigate, create and validate hypotheses.
Through this analysis I found that, the right choice of the type of plot, colors, titles, and legends can really improve the quality of the analysis and helps properly passing the insights found, so it’s definitely worth the time to read to make an polished graph.
For future work with this dataset it could really help to have at least more high quality wines, so it could be easier to identify the features that mostly contributed to achieve such quality level.